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TITLE: 
fibre 



Peer controller management in a dual controller 



channel storage enclosure 



KWIC 



Brief Summary Text - BSTX (5) : 

Because of the high bandwidth and flexible connectivity provided by 
the FC, 

the FC is becoming a common medium for interconnecting peripheral 
devices 

within multi -peripheral -device enclosures, such as redundant arrays of 
inexpensive disks ("RAIDs")/ and for connecting multi -peripheral-device 
enclosures with one or more host computers. These 
multi -peripheral -device 

enclosures economically provide greatly increased storage capacities 
and 

built-in redundancy that facilitates mirroring and fail over strategies 
needed 

in high-availability systems. Although the PC is well-suited for this 
application with regard to capacity and connectivity, the FC is a 
serial 

communications medium. Malfunctioning peripheral devices and 
enclosures can, 

in certain cases, degrade or disable communications. A need has 
therefore been 

recognized for methods to improve the ability of FC-based 
multi -peripheral -device enclosures to isolate and recover from 
mal f unc t ioning 

peripheral devices. A need has also been recognized for additional 
communications and component redundancies within 
multi-peripheral -device 

enclosures to facilitate higher levels of fault-tolerance and 
high-availability. 
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TITLE: 

detection and 



Method and apparatus for providing failure 



recovery with predetermined replication style for 
distributed applications in a network 



KWIC 



Brief Summary Text - BSTX (13) : 

In accordance with the present invention, an application module 
running on a 

host computer is made reliable by first registering itself for its own 
failure 

and recovery processes. A ReplicaManager daemon process, running on 
the same 

host computer on which the application module is running or on another 
host 

computer connected to the network to which the application module's 
machine is 

connected/ receives a registration message from the application module. 
This 

registration message, in addition to identifying the registering 
application 

module and the host machine on which it is running, includes the 
particular 

replication strategy (cold, warm or hot backup style) and the degree of 
replication to be associated with the registered application module, 
which 

registered replication strategy is used by the ReplicaManager to set 
the 

operating state of each backup copy of the application module as well 
as to 

maintain the number of backup copies in accordance with the degree of 
replication. A Watchdog daemon process, running on the same host 
computer as 

the registered application module then periodically monitors the 
registered 

application module to detect failures. When the Watchdog daemon 
detects a 

crash or a hangup of the monitored application module, it reports the 
failure 

to the ReplicaManager, which in turn effects a fail-over process. 
Accordingly, 

if the replication style is warm or hot and the failed application 
module 

cannot be restarted on its own host computer, one of the running backup 
copies 

of the primary application module is designated as the new primary 
application 

module and a host computer on which an idle copy of the application 



module 

resides is signaled over the network to execute that idle application. 
The 

degree of replication is thus maintained thereby assuring protection 
against 

multiple failures of that application module. If the replication style 
is cold 

and the failed application is cannot be restarted on its own host 
computer, 

then a host computer on which an idle copy of the application module 
resides is 

signaled over the network to execute the idle copy. In order to detect 
a 

failure of a host computer or the Watchdog daemon running on a host 
computer, a 

SuperWatchDog daemon process, running on the same host computer as the 
ReplicaManager, detects inputs from each host computer. Upon a host 
computer 

failure, detected by the SuperWatchDog daemon by the lack of an input 
from that 

host computer, the ReplicaManager is accessed to determine the 
application 

modules that were running on that host computer. Those application 
modules are 

then individually failure-protected in the manner established and 
stored in the 
ReplicaManager . 
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TITLE : 
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Brief Summary Text - BSTX (5) : 

Because of the high bandwidth and flexible connectivity provided by 
the FC, 

the FC is becoming a common medium for interconnecting peripheral 
devices 

within multi-peripheral-device enclosures, such as redundant arrays of 
inexpensive disks ( "RAIDs " ) , and for connecting multi -peripheral -device 
enclosures with one or more host computers. These 
multi-peripheral -device 

enclosures economically provide greatly increased storage capacities 
and 

built-in redundancy that facilitates mirroring and fail over strategies 
needed 

in high-availability systems. Although the FC is well-suited for this 
application with regard to capacity and connectivity, the FC is a 
serial 

communications medium. Malfunctioning peripheral devices and 
enclosures can, 

in certain cases, degrade or disable communications. A need has 
therefore been 

recognized for methods to improve the ability of FC-based 
multi -peripheral -device enclosures to isolate and recover from 
malfunctioning 

peripheral devices, and for improving the ability of systems including 
one or 

more host computers and multiple, interconnected FC-based 
multi-peripheral-device enclosures to isolate and recover from a 
malfunctioning 

multi-peripheral-device enclosure. A need has also been recognized for 
additional communications and component redundancies within 
multi-peripheral-device enclosures to facilitate higher levels of 
fault-tolerance and high-availability. 
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TITLE : 

detection and 



Method and apparatus for providing failure 



recovery with predetermined degree of replication 



for 



distributed applications in a network 



KWIC 



Brief Summary Text - BSTX (13) : 

In accordance with the present invention, an application module 
running on a 

host computer is made reliable by first registering itself for its own 
failure 

an< * recovery processes. A ReplicaManager daemon process, running on 
the same 

host computer on which the application module is running or on another 
host 

computer connected to the network to which the application module's 
machine is 

connected, receives a registration message from the application module. 
This 

registration message, in addition to identifying the registering 
application 

module and the host machine on which it is running, includes the 
particular 

replication strategy (cold, warm or hot backup style) and the degree of 
replication to be associated with the registered application module, 
which 

registered replication strategy is used by the ReplicaManager to set 



operating state of each backup copy of the application module as well 
as to 

maintain the number of backup copies in accordance with the degree of 
replication. A Watchdog daemon process, running on the same host 
computer as 

the registered application module then periodically monitors the 
registered 

application module to detect failures. When the Watchdog daemon 
detects a 

crash or a hangup of the monitored application module, it reports the 
failure 

to the ReplicaManager, which in turn effects a fail-over process. 
Accordingly, 

if the replication style is warm or hot and the failed application 
module 

cannot be restarted on its own host computer, one of the running backup 
copies 

of the primary application module is designated as the new primary 
application 

module and a host computer on which an idle copy of the application 



the 



module 

resides is signaled over the network to execute that idle application. 
The 

degree of replication is thus maintained thereby assuring protection 
against 

multiple failures of that application module. If the replication style 
is cold 

and the failed application is cannot be restarted on its own host 
computer, 

then a host computer on which an idle copy of the application module 
resides is 

signaled over the network to execute the idle copy. In order to detect 
a 

failure of a host computer or the Watchdog daemon running on a host 
computer, a 

SuperWatchDog daemon process, running on the same host computer as the 
ReplicaManager , detects inputs from each host computer. Upon a host 
computer 

failure, detected by the SuperWatchDog daemon by the lack of an input 
from that 

host computer, the ReplicaManager is accessed to determine the 
application 

modules that were running on that host computer. Those application 
modules are 

then individually failure-protected in the manner established and 

stored in the 

ReplicaManager. 
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Detailed Description Text - DETX (70) : 

FIG. 4B shows which of the software modules, described and discussed 
above 

in correction with FIG. 2B, is associated with the processing by an 
aware 

client of a fail-over or fail-back on the network. Fail -over refers to 
the 

response, by aware clients seeking access to a resource, to the failure 
of a 

node, e.g. server, designated in the name driver module 194 for 
accessing that 

resource. Fail-back deals with the behavior of an aware client in 
response to 

a recovery of a node, e.g. server, on the network from a failed 
condition. The 

operation begins, in a manner similar to that described and discussed 
above in 

connection with FIG. 4A, with the issuance of an I/O request by the 
application 

module 196. That request is passed to the command processing module 
192 . 

Since the I/O request is destined for an external resource, the path to 
the 

resource needs to be determined. The request is therefore passed to 
the 

resource management module 186 and to the name driver module 194 to 
obtain the 

path. The command processing module 192 passes the request with path 
information to fail-over module 188 for further processing. Fail-over 
module 

188 then calls the redirector module 184 to send the I/O request via 
the path 

obtained from the name driver. If fail -over module 188 determines that 
there 

is a failure, it calls the name driver module to provide an alternate 
path for 

the I/O operation, and the fail -over module 188 reissues the I/O 
command with 

the alternate path to the redirector module 184. Data passing between 



resource and the application module 196 is passed via the redirector 
module 

184. Upon failure detection and redirecting by fail -over module 188, 
name 



the 



driver module 194 marks the path as failed. Periodically, name driver 
module 

194 checks the network for the valid presence of the failed paths and, 
if good, 

once again marks them failed-back or valid so that they may once again 
be used 

in the future, if necessary. 
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Detailed Description Text - DETX (73) : 

FIG. 4B shows which of the software modules, described and discussed 
above 

i n connection with FIG. 2B, is associated with the processing by an 
aware 

client of a fail-over or fail-back on the network. Fail -over refers to 
the 

response, by aware clients seeking access to a resource, to the failure 
of a 

node, e.g. server, designated in the name driver module 194 for 
accessing that 

resource. Fail -back deals with the behavior of an aware client in 
response to 

a recovery of a node, e.g. server, on the network from a failed 
condition. The 

operation begins, in a manner similar to that described and discussed 
above in 

connection with FIG. 4A, with the issuance of an I/O request by the 
application 

module 196. That request is passed to the command processing module 
192. 

Since the I/O request is destined for all external resource, the path 
to the 

resource needs to be determined. The request is therefore passed to 
the 

resource management module 186 and to the name driver module 194 to 
obtain the 

path. The command processing module 192 passes the request with path 
information to fail-over module 188 for further processing. Fail-over 
module 

188 then calls the redirector module 184 to send the I/O request via 
the path 

obtained from the name driver. If fail -over module 188 determines that 
there 

is a failure, it calls the name driver module to provide an alternate 
path for 

the I/O operation, and the fail-over module 188 reissues the I/O 
command with 

the alternate path to the redirector module 184. Data passing between 
the 

resource and the application module 196 is passed via the redirector 
module 

184. Upon failure detection and redirecting by fail -over module 188, 
name 

driver module 194 marks the path as failed. Periodically, name driver 
module 



• # 

194 checks the network for the valid presence of the failed paths and, 
if good, 

once again marks them failed-back or valid so that they may once again 
be used 

in the future, if necessary. 
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Detailed Description Text - DETX (102) : 

FIG. 4B shows which of the software modules described and discussed 
above in 

connection with FIG. 2B is associated with the processing by an aware 
client of 

a fail-over or fail-back on the network. Fail -over refers to the 
response by 

aware clients seeking access to a resource to the failure of a node, 
e.g. 

server, designated in the name driver module 194 for accessing that 
resource. 

Fail -back deals with the behavior of an aware client in response to a 
recovery 

of a node, e.g. server, on the network from a failed condition. The 
operation 

begins in a manner similar to that described and discussed above in 
connection 

with FIG. 4A with the issuance of an I/O request by the application 
module 196. 

That request is passed to the command processing module 192. Since the 
I/O 

request is destined for an external resources the path to the resource 
needs to 

be determined. The request is therefore passed to the resource 
management 

module 186 and to the name driver module 194 to obtain the path. The 
command 

processing module 192 passes the request with path information to 
fail -over 

module 188 for further processing. Fail -over module 188 then calls the 
redirector module 184 to send the I/O request via the path obtained 
from the 

name driver. If fail-over module 188 determines there is a failure it 
calls 

the name driver module to provide an alternate path for the I/O 
operation and 

the fail-over module 188 reissues the I/O command with the alternate 
path to 

the redirector module 184. Data passing between the resource and the 
application module 196 is passed via the redirector module 184. Upon 
failure 

detection and redirecting by fail -over module 188, name driver module 
194 marks 

the path as failed. Periodically name driver module 194 checks the 
network for 



• # 

the valid presence of the failed paths and if good, once again marks 
them 

failed-back or valid so that they may once again be used in the future 
if 

necessary. 



